-
Notifications
You must be signed in to change notification settings - Fork 761
Adding Doc.content_hash
and integrating for unique DocDetails.doc_id
#1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds content hashing capabilities to enable unique document identification for files that share metadata but have different content, such as main papers and their supplemental information. The implementation adds a content_hash
field to the Doc
class and integrates it into the DocDetails.doc_id
generation to create composite keys based on both DOI and content hash.
- Adds
content_hash
field toDoc
class with auto-population from file contents - Updates
DocDetails.doc_id
to be a composite key of DOI and content hash via newcompute_unique_doc_id
function - Enhances test coverage with scenarios for duplicate detection across files with shared metadata but different content
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/test_paperqa.py |
Expands duplicate detection tests to validate content hash differentiation |
src/paperqa/utils.py |
Adds compute_unique_doc_id function for generating composite document IDs |
src/paperqa/types.py |
Adds content_hash field to Doc and updates DocDetails validation/merging logic |
src/paperqa/llms.py |
Updates Doc instantiation to include content hash field |
src/paperqa/docs.py |
Integrates content hash computation during document addition |
src/paperqa/clients/__init__.py |
Updates metadata client to handle content hash field during doc upgrades |
Comments suppressed due to low confidence (1)
tests/test_paperqa.py:1068
- The test expects specific hardcoded dockey values, but these appear to be content hash-based values that could change if the document content or hashing logic changes. Consider using a more flexible assertion that validates the dockey format or length rather than exact values.
assert doc_details.dockey in {"8ce7ddba9c9dcae6", "a353fa2478475c9c"}
if doi: | ||
value_to_encode: str = doi.lower() + (content_hash or "") | ||
else: | ||
value_to_encode = content_hash or str(uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str(uuid4())
as a fallback when both DOI and content_hash are None could lead to non-deterministic document IDs for the same document across different runs. Consider if this is the intended behavior or if a more deterministic fallback would be appropriate.
value_to_encode = content_hash or str(uuid4()) | |
value_to_encode = content_hash or "default" |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recompute doc_id
if there's changes in DocDetails.__add__
, so I believe we're safe
dockey_is_content_hash = False | ||
if dockey is None: | ||
# md5 sum of file contents (not path!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this comment to Doc.content_hash
's description
, fyi
|
||
# note we have some extra fields which may have come from reading the doc text, | ||
# but aren't in the doc object, we add them here too. | ||
extra_fields = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of "extra" here is an overloaded term:
- Extra in a Pydantic context (and we have
DocDetails.other
, to add to the confusion) - Extra in a query context
Paired with vagueness such as "some extra fields", I decided to rename and reworded this whole section
), "Expected citation to be inferred" | ||
assert shorter_flag_day.content_hash | ||
assert flag_day.content_hash != shorter_flag_day.content_hash | ||
assert flag_day.doc_id != shorter_flag_day.doc_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole PR basically enables this singular assertion to pass
e685706
to
542eb13
Compare
elif "doc_id" not in data or not data["doc_id"]: # keep user defined doc_ids | ||
data["doc_id"] = encode_id(uuid4()) | ||
if not data.get("doc_id"): # keep user defined doc_ids | ||
data["doc_id"] = compute_unique_doc_id(doi, data.get("content_hash")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we talked about this in person, but let's move this into an optional feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I piloted this today some more btw. Moving compute_unique_doc_id
into something like Settings.details_id_factory
requires us to pass a throwaway Settings
reference to all DocDetails
constructions, which is both awkward and error prone.
Perhaps a less awkward route is having an bool
environment variable PQA_USE_OLD_ID_FACTORY
that gets used inside compute_unique_doc_id
542eb13
to
b614322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to read this in more detail and think about it
b614322
to
bc35ab5
Compare
…doc_to_doc_details works
…tateful to __add__
bc35ab5
to
1df5ece
Compare
This PR:
DocDetails.doc_id
deduplication across files sharing the beginning (e.g. a main text and a supplemental information whose metadata is inferred to be identical)content_hash
field onDoc
(which we already collect duringDocs.aadd
)DocDetails.doc_id
to be a composite key of DOI and content hashtest_duplicate
that does not work on currentmain
DocDetails
metadata upgrades andfile_location
Closes #1005